ASYMPTOTIC BEHAVIOR OF k-WORD MATCHES BETWEEN TWO UNIFORMLY DISTRIBUTED SEQUENCES
نویسندگان
چکیده
Given two sequences of length n over a finite alphabet A of size |A| = d, the D2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior ofD2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein’s method, we show that, for large enough k, theD2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D2 statistic is approximately normal as n gets large. We also give a formula for the variance of D2 in the uniform case.
منابع مشابه
Distributional regimes for the number of k-word matches between two random sequences.
When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the ri...
متن کاملThe Distribution of Short Word Match Counts between Markovian Sequences
The D2 statistic, which counts the number of word matches between two given sequences, has long been proposed as a measure of similarity for biological sequences. Much of the mathematically rigorous work carried out to date on the properties of the D2 statistic has been restricted to the case of ‘Bernoulli’ sequences composed of identically and independently distributed letters. Here the proper...
متن کاملApproximate Word Matches between Two Random Sequences
Given two sequences over a finite alphabet L, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k <m, we look at the count of m-lett...
متن کاملAPPROXIMATE WORD MATCHES BETWEEN TWO RANDOM SEQUENCES By Conrad
Given two sequences over a finite alphabet L, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k <m, we look at the count of m-lett...
متن کاملUpper and Lower Class Sequences for Minimal Uniform Spacings
In this paper we investigate the asymptotic behavior of the k-th smallest uniform spacing. Among other things, a complete characterization of upper and lower class sequences is obtained. The asymptotic behavior is similar in many respects to that of the minimum of independent uniformly distributed random variables. Let X1 , . . . ,X , be independent identically distributed uniform (0, 1) random...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007